Skip to content

[Frontend] Add chunked processing to handle long inputs in embedding models #20837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

x22x22
Copy link

@x22x22 x22x22 commented Jul 11, 2025

…g, and update relevant documentation and examples. New example scripts and service startup scripts are added to demonstrate how to configure and utilize chunking processing. Update the model configuration to support long - text processing and implement the chunking processing logic in the code.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Add chunked processing support for long text embeddings to resolve CUDA crashes when input text exceeds model's maximum context length.

Problem Solved

  • CUDA crashes: vLLM embedding service crashes when processing text longer than max_model_len
  • Limited input length: No native support for handling arbitrarily long text in embedding models
  • Memory constraints: Large inputs cause out-of-memory errors during embedding generation

Solution

This PR implements automatic chunked processing at the serving layer that:

  • ✅ Automatically detects when input exceeds model limits
  • ✅ Splits long text into manageable chunks at token boundaries
  • ✅ Processes each chunk independently to avoid memory issues
  • ✅ Aggregates results using FastChat-style weighted averaging
  • ✅ Maintains backward compatibility for short text inputs
  • ✅ Requires zero changes to existing model implementations

Key Features

  • Zero model code modification: All logic implemented in serving layer
  • Configurable: Enabled via enable_chunked_processing: true in pooler config
  • Smart aggregation: Token count-based weighted averaging preserves semantic quality
  • Production ready: Comprehensive error handling and logging

Supported Models

  • intfloat/multilingual-e5-large (initially)
  • Extensible architecture for other embedding models

This enables vLLM to handle embedding requests of any length without crashes, significantly expanding its utility for RAG applications and long document processing.

Test Plan

Long Text Embedding with Chunked Processing

Test Result

Before modification

  • serve
ERROR 07-12 02:52:36 [engine.py:165] RuntimeError('CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, alpha_ptr, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, beta_ptr, c, CUDA_R_16F, ldc, compute_type, CUBLAS
_GEMM_DEFAULT_TENSOR_OP)`')
ERROR 07-12 02:52:36 [engine.py:165] Traceback (most recent call last):
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/multiprocessing/engine.py", line 163, in start 
ERROR 07-12 02:52:36 [engine.py:165]     self.run_engine_loop()
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/multiprocessing/engine.py", line 226, in run_engine_loop
ERROR 07-12 02:52:36 [engine.py:165]     request_outputs = self.engine_step()
ERROR 07-12 02:52:36 [engine.py:165]                       ^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/multiprocessing/engine.py", line 252, in engine_step
ERROR 07-12 02:52:36 [engine.py:165]     raise e
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/multiprocessing/engine.py", line 235, in engine_step
ERROR 07-12 02:52:36 [engine.py:165]     return self.engine.step()
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/engine/llm_engine.py", line 1356, in step
ERROR 07-12 02:52:36 [engine.py:165]     outputs = self.model_executor.execute_model(
ERROR 07-12 02:52:36 [engine.py:165]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/executor/executor_base.py", line 141, in execute_model
ERROR 07-12 02:52:36 [engine.py:165]     output = self.collective_rpc("execute_model",
ERROR 07-12 02:52:36 [engine.py:165]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-12 02:52:36 [engine.py:165]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-12 02:52:36 [engine.py:165]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/utils/__init__.py", line 2943, in run_method
ERROR 07-12 02:52:36 [engine.py:165]     return func(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 07-12 02:52:36 [engine.py:165]     output = self.model_runner.execute_model(
ERROR 07-12 02:52:36 [engine.py:165]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-12 02:52:36 [engine.py:165]     return func(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/worker/pooling_model_runner.py", line 119, in execute_model
ERROR 07-12 02:52:36 [engine.py:165]     hidden_or_intermediate_states = model_executable(
ERROR 07-12 02:52:36 [engine.py:165]                                     ^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return self._call_impl(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return forward_call(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/model_executor/models/bert.py", line 415, in forward
ERROR 07-12 02:52:36 [engine.py:165]     return self.model(input_ids=input_ids,
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return self._call_impl(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return forward_call(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/model_executor/models/bert.py", line 350, in forward
ERROR 07-12 02:52:36 [engine.py:165]     return self.encoder(hidden_states)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/compilation/decorators.py", line 246, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     model_output = self.forward(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/model_executor/models/bert.py", line 114, in forward
ERROR 07-12 02:52:36 [engine.py:165]     def forward(
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return self._call_impl(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return forward_call(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 07-12 02:52:36 [engine.py:165]     return fn(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 07-12 02:52:36 [engine.py:165]     return self._wrapped_call(self, *args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     raise e
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return self._call_impl(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-12 02:52:36 [engine.py:165]     return forward_call(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "<eval_with_key>.2", line 294, in forward
ERROR 07-12 02:52:36 [engine.py:165]     submod_0 = self.submod_0(l_hidden_states_,...l_self_modules_layer_module
s_23_modules_output_modules_layer_norm_parameters_bias_ = None
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/llm/vllm-250711/vllm/compilation/cuda_piecewise_backend.py", line 117, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     return self.compiled_graph_for_general_shape(*args)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 2143, in wrapper
ERROR 07-12 02:52:36 [engine.py:165]     return pytree.tree_unflatten(compiled_fn(*args, **kwargs), spec)
ERROR 07-12 02:52:36 [engine.py:165]                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 07-12 02:52:36 [engine.py:165]     return fn(*args, **kwargs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1201, in forward
ERROR 07-12 02:52:36 [engine.py:165]     return compiled_fn(full_args)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
ERROR 07-12 02:52:36 [engine.py:165]     all_outs = call_func_at_runtime_with_args(
ERROR 07-12 02:52:36 [engine.py:165]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
ERROR 07-12 02:52:36 [engine.py:165]     out = normalize_as_list(f(args))
ERROR 07-12 02:52:36 [engine.py:165]                             ^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
ERROR 07-12 02:52:36 [engine.py:165]     outs = compiled_fn(args)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
ERROR 07-12 02:52:36 [engine.py:165]     return compiled_fn(runtime_args)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_inductor/output_code.py", line 460, in __call__
ERROR 07-12 02:52:36 [engine.py:165]     return self.current_callable(inputs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/data/conda/envs/vllm-250411/lib/python3.11/site-packages/torch/_inductor/utils.py", line 2404, in run
ERROR 07-12 02:52:36 [engine.py:165]     return model(new_inputs)
ERROR 07-12 02:52:36 [engine.py:165]            ^^^^^^^^^^^^^^^^^
ERROR 07-12 02:52:36 [engine.py:165]   File "/hs_data/.cache/vllm/torch_compile_cache/12188d34d2/rank_0_0/inductor_cache/xq/cxqsnh7zlyb6wqrdkusizoacfp34wawoczfn2qrddhljgmde7x2e.py", line 520, in call
ERROR 07-12 02:52:36 [engine.py:165]     extern_kernels.mm(reinterpret_tensor(buf1, (s0, 1024), (1024, 1), 0), reinterpret_tensor(arg4_1, (1024, 1024), (1, 1024), 0), out=buf4)
ERROR 07-12 02:52:36 [engine.py:165] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, alpha_ptr, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, beta_ptr, c, CUDA_R_16F, ldc, compute_type, CUBLAS
_GEMM_DEFAULT_TENSOR_OP)`
[rank0]:[W712 02:52:37.923419125 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (f
unction operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2509407]

After modification

  • serve
INFO 07-12 02:20:40 [logger.py:43] Received request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-0: prompt: '', params: PoolingParams(dimensions=None, use_cross_encoder=False, additional_metadata=None), prompt_token_ids: [0, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-12 02:20:40 [logger.py:43] Received request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-1: prompt: '', params: PoolingParams(dimensions=None, use_cross_encoder=False, additional_metadata=None), prompt_token_ids: [7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-12 02:20:40 [logger.py:43] Received request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-2: prompt: '', params: PoolingParams(dimensions=None, use_cross_encoder=False, additional_metadata=None), prompt_token_ids: [214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-12 02:20:40 [logger.py:43] Received request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-3: prompt: '', params: PoolingParams(dimensions=None, use_cross_encoder=False, additional_metadata=None), prompt_token_ids: [6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 39215, 6892, 2408, 3034, 7986, 100, 7839, 12225, 9433, 214, 44622, 1363, 5, 2], prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 07-12 02:20:40 [engine.py:317] Added request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-0.
INFO 07-12 02:20:40 [engine.py:317] Added request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-1.
INFO 07-12 02:20:40 [engine.py:317] Added request embd-b02f362e260a4e218c570cc6ab1fb346-chunk-2.
  • client
# python ./examples/online_serving/openai_embedding_long_text_client.py
🚀 vLLM Long Text Embedding Client
📡 Connecting to: http://localhost:31090/v1
🤖 Model: multilingual-e5-large
🔑 API Key: ********-key
🧪 Testing vLLM Long Text Embedding with Chunked Processing
======================================================================

📝 Test 1: Short Text
Text length: 42 characters
✅ Success!
   - Embedding dimension: 1024
   - Processing time: 0.54s
   - Expected chunks: ~1
   - First 5 values: [0.01232257578521967, 0.009728744626045227, -0.014059314504265785, -0.03867439180612564, 0.037110574543476105]

📝 Test 2: Medium Text
Text length: 3200 characters
✅ Success!
   - Embedding dimension: 1024
   - Processing time: 0.04s
   - Expected chunks: ~1
   - First 5 values: [0.04108031839132309, -0.009568133391439915, -0.028527623042464256, -0.04032902047038078, 0.020682798698544502]

📝 Test 3: Long Text (2 chunks)
Text length: 27250 characters
✅ Success!
   - Embedding dimension: 1024
   - Processing time: 0.07s
   - Expected chunks: ~2
   - First 5 values: [0.04508449137210846, -0.017967931926250458, -0.014230169355869293, -0.03835897892713547, 0.003280746517702937]

📝 Test 4: Very Long Text (3+ chunks)
Text length: 88000 characters
✅ Success!
   - Embedding dimension: 1024
   - Processing time: 0.16s
   - Expected chunks: ~3
   - First 5 values: [0.03270554542541504, 0.0007968051359057426, -0.016265524551272392, -0.03590775281190872, -0.009043066762387753]

🔄 Testing Batch Embedding with Mixed Lengths
==================================================
✅ Batch processing successful!
   - Number of inputs: 4
   - Number of embeddings: 4
   - Total processing time: 0.08s
   - Average time per input: 0.02s
   - Input 1: 12 chars → 1024D embedding
   - Input 2: 860 chars → 1024D embedding
   - Input 3: 18 chars → 1024D embedding
   - Input 4: 20000 chars → 1024D embedding

🔍 Testing Embedding Consistency
========================================
   - Generated embedding 1
   - Generated embedding 2
   - Generated embedding 3
✅ Consistency test completed!
   - Cosine similarity between runs: 1.000000
   - Expected: ~1.0 (identical embeddings)
   - ✅ High consistency achieved!

======================================================================
🎉 All tests completed!

💡 Key Features Demonstrated:
   - ✅ Automatic chunked processing for long text
   - ✅ Seamless handling of mixed-length batches
   - ✅ Consistent embedding generation
   - ✅ Backward compatibility with short text

📚 For more information, see:
   - Documentation: https://docs.vllm.ai/en/latest/models/pooling_models.html
   - Chunked Processing Guide: openai_embedding_long_text.md

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @x22x22, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to vLLM's embedding capabilities by adding support for chunked processing of long text inputs. This feature directly addresses issues like CUDA crashes and memory constraints when processing text longer than a model's max_model_len, allowing vLLM to generate embeddings for arbitrarily long documents without requiring changes to the underlying models. The solution is implemented at the serving layer, providing a configurable and robust mechanism for handling large inputs while maintaining embedding quality through smart aggregation.

Highlights

  • New Feature: Chunked Processing for Embedding Models: Introduced automatic chunked processing at the serving layer for embedding models. This enables vLLM to handle text inputs that exceed the model's maximum context length by splitting them into manageable chunks, processing each independently, and aggregating the results.
  • Configuration and Activation: The chunked processing feature is configurable via the PoolerConfig by setting enable_chunked_processing: true. It automatically detects when input exceeds max_model_len and triggers the chunking logic.
  • Intelligent Aggregation: Implemented a FastChat-style weighted averaging algorithm to combine embeddings from multiple chunks. This method uses token counts as weights, ensuring that longer chunks contribute proportionally more to the final aggregated embedding, preserving semantic quality.
  • Backward Compatibility and Model Support: The implementation maintains backward compatibility for short text inputs and requires zero modifications to existing model code. Initially, intfloat/multilingual-e5-large is explicitly supported, with an extensible architecture for other embedding models.
  • Documentation and Examples: Added comprehensive documentation detailing the feature, its configuration, how it works, performance characteristics, and limitations. New example scripts (server and client) are provided to demonstrate how to configure and utilize chunking processing for long text embeddings.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added documentation Improvements or additions to documentation frontend labels Jul 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for handling long text embeddings by implementing chunked processing. The code is generally well-written, and the documentation and examples are thorough. I've identified a critical bug in the batch processing logic that could lead to incorrect results when multiple long prompts are sent in a single request. I've also provided several suggestions to improve code quality, maintainability, and performance. Once the critical issue is addressed, this will be a great addition to the project.

…g, and update relevant documentation and examples. New example scripts and service startup scripts are added to demonstrate how to configure and utilize chunking processing. Update the model configuration to support long - text processing and implement the chunking processing logic in the code.

Signed-off-by: x22x22 <wadeking@qq.com>
@x22x22 x22x22 force-pushed the feat/support-long-text-embedding branch from b5f245d to 5398bbd Compare July 11, 2025 19:01
x22x22 added 2 commits July 12, 2025 03:21
… with isort, and ensure the accuracy of docstrings.

Signed-off-by: x22x22 <wadeking@qq.com>
…ompts, and improve the implementation of chunk processing to ensure accuracy and efficiency when handling long texts. Meanwhile, relevant type annotations have been updated to enhance code readability and type safety.

Signed-off-by: x22x22 <wadeking@qq.com>
@x22x22 x22x22 changed the title [Core] Add chunked processing to handle long inputs in embedding models [Frontend] Add chunked processing to handle long inputs in embedding models Jul 11, 2025
x22x22 added 5 commits July 12, 2025 04:06
…ess of block IDs and fix the block ID conflicts in batch processing. Updated relevant examples to demonstrate the new features.

Signed-off-by: x22x22 <wadeking@qq.com>
…ess of block IDs and fix the block ID conflicts in batch processing. Updated relevant examples to demonstrate the new features.

Signed-off-by: x22x22 <wadeking@qq.com>
…f the "Slow Processing" section from 1 to 3 to ensure the accuracy and consistency of the list.

Signed-off-by: x22x22 <wadeking@qq.com>
…_CODE to enhance the flexibility of the model name, and use this variable to replace the hard - coded model name in the output information. Ensure that the configuration during service startup is more consistent and maintainable.

Signed-off-by: x22x22 <wadeking@qq.com>
…verify the uniqueness of block IDs and resolve the block ID conflict issues in batch processing. Meanwhile, relevant documents and examples have been updated to ensure the accuracy and consistency of long - text processing.

Signed-off-by: x22x22 <wadeking@qq.com>
@DarkLight1337
Copy link
Member

cc @maxdebayser @22quinn @noooop

@noooop
Copy link
Contributor

noooop commented Jul 12, 2025

In fact, embedding models are not very suitable for handling extremely long inputs, as too much content can lead to embeddings that are not able to effectively distinguish between similar content.

Here's a simple way to confirm that automatic chunked processing is working effectively:

Reference mteb_test_embed_models in vllm/tests/models/language/pooling
/mteb_utils.py . and https://github.com/noooop/snippet/blob/main/benchmarks/test_mteb/test_speed.py

Keeping only the very front part of long context, such as 2048 or even 512, is an extremely high baseline.
Refer to LongEmbed: Extending Embedding Models for Long Context Retrieval

However, it still suffers from biased distribution of key information, as demonstratedin Figure 2. With only 512 context length, E5Base achieves >85% nDCG scores on 3 out of 8 publicly available LoCo tasks.

Do the following three comparative experiments

  • max_model_len = 2048
  • max_model_len =8102
  • max_model_len = 2048 + automatic chunked processing

If automatic chunked processing using multilingual-e5-large on mteb/T2Reranking dataset(or any test with a context exceeding 8K), can achieve comparable results indicates that automatic chunked processing is effective

@x22x22
Copy link
Author

x22x22 commented Jul 12, 2025

In fact, embedding models are not very suitable for handling extremely long inputs, as too much content can lead to embeddings that are not able to effectively distinguish between similar content.

Here's a simple way to confirm that automatic chunked processing is working effectively:

Reference mteb_test_embed_models in vllm/tests/models/language/pooling /mteb_utils.py . and https://github.com/noooop/snippet/blob/main/benchmarks/test_mteb/test_speed.py

Keeping only the very front part of long context, such as 2048 or even 512, is an extremely high baseline. Refer to LongEmbed: Extending Embedding Models for Long Context Retrieval

However, it still suffers from biased distribution of key information, as demonstratedin Figure 2. With only 512 context length, E5Base achieves >85% nDCG scores on 3 out of 8 publicly available LoCo tasks.

Do the following three comparative experiments

  • max_model_len = 2048
  • max_model_len =8102
  • max_model_len = 2048 + automatic chunked processing

If automatic chunked processing using multilingual-e5-large on mteb/T2Reranking dataset(or any test with a context exceeding 8K), can achieve comparable results indicates that automatic chunked processing is effective

@noooop I've manually tested using text chunks exceeding 1,000 tokens in vector databases, and confirmed that short user queries or task descriptions (~100 tokens) can successfully retrieve relevant text fragments.

While this verification isn't scientifically rigorous, it demonstrates a viable practical solution. I'll allocate time later to run the benchmark tests you recommended - appreciate the suggestion.

@noooop
Copy link
Contributor

noooop commented Jul 13, 2025

@x22x22

After some investigation, intfloat/multilingual-e5-large uses the classic BERT architecture with a context length of 512, which appears very weak in 2025. Please perform a comparative test using jina-embeddings-v3, which has a maximum context length of 8192 and uses mean pooling.

Unless you use VLLM_ALLOW_LONG_MAX_MODEL_LEN or similar, you Should Not Allow set the context of intfloat/multilingual-e5-large beyond 512, as it will exceed position_embeddings and cause an out-of-bounds error. It is not a bug. Please weaken or remove the content related to CUDA crashes.

@x22x22
Copy link
Author

x22x22 commented Jul 13, 2025

@x22x22

After some investigation, intfloat/multilingual-e5-large uses the classic BERT architecture with a context length of 512, which appears very weak in 2025. Please perform a comparative test using jina-embeddings-v3, which has a maximum context length of 8192 and uses mean pooling.

Unless you use VLLM_ALLOW_LONG_MAX_MODEL_LEN or similar, you Should Not Allow set the context of intfloat/multilingual-e5-large beyond 512, as it will exceed position_embeddings and cause an out-of-bounds error. It is not a bug. Please weaken or remove the content related to CUDA crashes.

@noooop
This enhancement specifically leverages VLLM_ALLOW_LONG_MAX_MODEL_LEN, and you can see the corresponding launch code in my test script here:
https://github.com/vllm-project/vllm/blob/da812672715ac5bb09a4e5e4acb1d6d2d59feca7/examples/online_serving/openai_embedding_long_text_service.sh

The purpose is to enable models like multilingual-e5-large to support longer contexts through sharding without modifying the model's original code. The same principle applies to other embedding models - for example, if you want jina-embeddings-v3 to support beyond its native 8192 context length, simply adjusting the MAX_MODEL_LEN parameter would achieve this.

While this approach may not deliver optimal embedding performance, it provides a practical low-cost solution for RAG scenarios requiring simultaneous processing of both short and long texts. Crucially, no performance penalty occurs when input stays within a model's native context limit (e.g. ≤512 for E5, ≤8192 for Jina), as no special chunking gets triggered.


Would you be open to continuing this discussion more efficiently via https://slack.vllm.ai? I've requested access to the Slack workspace but haven't received approval yet - perhaps we could connect there once I'm onboarded.

@noooop
Copy link
Contributor

noooop commented Jul 13, 2025

I looked through the code carefully.

You can add a new parameter such as max_embed_len, but do not modify any code related to max_model_len, That will cause a huge number of bugs.

And do not use VLLM_ALLOW_LONG_MAX_MODEL_LEN.

I think we should remove VLLM_ALLOW_LONG_MAX_MODEL_LEN. I can’t think of any use case that would require this flag.

@x22x22
Copy link
Author

x22x22 commented Jul 15, 2025

@x22x22 , thanks for running the benchmarks and providing nice graphs, I think they are a convincing argument in favor of context extension. But I'm still not sure that out of the context extension methods, this is the best one. I think we should try the GP, RP and PI methods of the LongEmbeds paper with e5-large. Because they manipulate the position ids, these methods will work with all pooling types and can run entirely on the GPU. In the paper they also show superior performance. Do you have the time to try them? It shouldn't be too hard to run a proof of concept, I think the only required changes are a modification of the position_ids in bert.py or roberta.py, setting --max-model-len 32768 and VLLM_ALLOW_LONG_MAX_MODEL_LEN=1.

Otherwise, if you're willing to share your benchmarking script I can try to run this experiment.

@maxdebayser

I've committed my modifications to https://github.com/x22x22/LongEmbed/tree/feature/add-openai-embedding-support, please pull the branch.

# You can skip installing flash-attn and other dependencies, as we mainly need mteb, openai>=1.0.0, and tiktoken
pip install -r requirements.txt

# Modify the BASE_URL and API_KEY in scripts/run_openai_long_embed.sh
/bin/bash scripts/run_openai_long_embed.sh

After evaluation is complete, the results will be output to ./results.

I submitted this PR hoping to find a universal context extension approach for embedding models without intrusive modifications to the model source code - that's the advantage of this method. Of course, I understand this doesn't conflict with the optimization methods mentioned in the LongEmbeds paper. We could also create another PR that targets different models by modifying their source code to extend embedding model context.

@maxdebayser
Copy link
Contributor

@x22x22 , I implemented the RP, GP, and PI methods from the paper but couldn't get good results. For e5-multilingual-large, this is the typical output I'm getting with these context extension methods:

  "LEMBNeedleRetrieval": {
    "256": 0.74,
    "512": 0.72,
    "1024": 0.82,
    "2048": 0.4,
    "4096": 0.22,
    "8192": 0.02,
    "16384": 0.02,
    "32768": 0.0,
    "avg": 0.3675
  }

Up to 1024 and 2048 they seem to work and then the performance just drops. On the other hand I was able to reproduce your results with your branch, which is always good.

Can you also run the benchmark for models with CLS or LAST pooling?

@noooop
Copy link
Contributor

noooop commented Jul 17, 2025

Can you also run the benchmark for models with CLS or LAST pooling?

e5-multilingual-large uses mean pooling

"pooling_mode_mean_tokens": true, in https://huggingface.co/intfloat/multilingual-e5-large/blob/main/1_Pooling/config.json

Model inference using a different pooling method than training has very poor results

The automatic chunked processing proposed in this PR is likely to work well on models that use mean pooling.

If you want to experiment whether this method is compatible with cls pooling.
BAAI/bge-m3 uses cls pooling, and native support for 8K context length.

@maxdebayser
Copy link
Contributor

Thanks, @noooop, I meant adding tests results for other models with different pooling methods, not e5-large-multilingual as forcing another pooling method on that one would lead to poor results, as you pointed out.

@x22x22
Copy link
Author

x22x22 commented Jul 17, 2025

@x22x22 , I implemented the RP, GP, and PI methods from the paper but couldn't get good results. For e5-multilingual-large, this is the typical output I'm getting with these context extension methods:

  "LEMBNeedleRetrieval": {
    "256": 0.74,
    "512": 0.72,
    "1024": 0.82,
    "2048": 0.4,
    "4096": 0.22,
    "8192": 0.02,
    "16384": 0.02,
    "32768": 0.0,
    "avg": 0.3675
  }

Up to 1024 and 2048 they seem to work and then the performance just drops. On the other hand I was able to reproduce your results with your branch, which is always good.

Can you also run the benchmark for models with CLS or LAST pooling?

@x22x22 , I implemented the RP, GP, and PI methods from the paper but couldn't get good results. For e5-multilingual-large, this is the typical output I'm getting with these context extension methods:

  "LEMBNeedleRetrieval": {
    "256": 0.74,
    "512": 0.72,
    "1024": 0.82,
    "2048": 0.4,
    "4096": 0.22,
    "8192": 0.02,
    "16384": 0.02,
    "32768": 0.0,
    "avg": 0.3675
  }

Up to 1024 and 2048 they seem to work and then the performance just drops. On the other hand I was able to reproduce your results with your branch, which is always good.

Can you also run the benchmark for models with CLS or LAST pooling?

@maxdebayser
I've been quite busy with work lately, so I might not have time to test until tomorrow. I'll get back to you with the results.

@x22x22
Copy link
Author

x22x22 commented Jul 18, 2025

@x22x22 , I implemented the RP, GP, and PI methods from the paper but couldn't get good results. For e5-multilingual-large, this is the typical output I'm getting with these context extension methods:

  "LEMBNeedleRetrieval": {
    "256": 0.74,
    "512": 0.72,
    "1024": 0.82,
    "2048": 0.4,
    "4096": 0.22,
    "8192": 0.02,
    "16384": 0.02,
    "32768": 0.0,
    "avg": 0.3675
  }

Up to 1024 and 2048 they seem to work and then the performance just drops. On the other hand I was able to reproduce your results with your branch, which is always good.

Can you also run the benchmark for models with CLS or LAST pooling?

@maxdebayser

Based on the evaluation results, here are the findings for CLS and LAST pooling:

The evaluation results for CLS and LAST pooling are as follows:
https://x22x22.github.io/bge_m3_long_context_analysis.html
https://x22x22.github.io/qwen3_embedding_analysis.html

  1. According to the documentation and their open source repository https://github.com/FlagOpen/FlagEmbedding/blob/3bc1962480d13b793b5bda5d747e0b6cc66f73d9/examples/finetune/embedder/encoder_only/m3.sh#L50, BGE-M3 was trained using CLS pooling.

  2. Based on https://huggingface.co/Qwen/Qwen3-Embedding-0.6B/blob/main/1_Pooling/config.json, Qwen3-Embedding-0.6B appears to use LAST pooling.

Since Qwen3-Embedding-0.6B natively supports 32k length and the LongEmbed project's longest evaluation dataset is 32k, automatic chunking processing cannot be triggered if the length doesn't exceed 32k. Therefore, several extended length evaluation datasets were created based on the following logic:

**Core Method**: Generate various evaluation length combinations by merging documents of different lengths, including 32768+256=33024, 32768+512=33280, 32768+1024=33792, 32768+2048=34816, 32768+4096=36864, etc. The queries remain unchanged while only extending the target document lengths for retrieval.

**Why This Works**: LEMBNeedleRetrieval and LEMBPasskeyRetrieval are sentence-to-paragraph (s2p) tasks where queries are short sentences that need to be retrieved from long paragraphs. In such tasks, the queries are inherently fixed short texts, and what truly needs testing is the model's retrieval capability in longer documents. Therefore, document lengths can be safely extended without affecting the task's essence.

**Technical Implementation**: Uses the separator "--- Document Continuation ---" to connect documents of different lengths, updates document ID format to doc1_merged_doc2_target_length, and maintains correct correspondence between queries and new documents in qrels. Supports flexible generation of various length combinations from 33024 to 36864.

This method fully leverages the characteristics of s2p tasks: fixed queries and extensible documents, enabling progressive length extension evaluation that provides comprehensive testing for models' long document retrieval capabilities.

It's important to note that the evaluation dataset lengths in LongEmbed refer to string length, not token length. Therefore, only evaluation datasets of 36864 length and above truly contain content with token lengths exceeding 32k. Focus should be placed on evaluation scores for lengths greater than or equal to 36864.

@noooop
Copy link
Contributor

noooop commented Jul 18, 2025

  • For models that support long context, such as jina-embeddings-v3, using a small max_model_len (e.g., 1k) + automatic chunked processing achieves similar results as max_model_len (e.g., 2k), but faster(Needs data support).

In fact, I am more concerned about this data.

Please use

  1. Use the correct pooling method within max_model_len
  • jina-embeddings-v3 with mean pooling,
  • BAAI/bge-m3 with cls pooling,
  • Qwen3-Embedding-0.6B with LAST pooling,
  1. Limit max_model_len, enforce automatic chunked processing
    set max_model_len = 512,

  2. use automatic chunked processing to extend up to 8k or 32K, (Equivalent to using mean pooling beyond max_model_len)

  3. Compare with the native long context

  • whether the performance is comparable to the original context expansion capability,
  • and whether there is a significant improvement in speed.

I even feel that the automatic chunked processing extension may perform better than the original context expansion in some scenarios, more than just speed up(more than performance vs speed trade off).

@x22x22
Copy link
Author

x22x22 commented Jul 18, 2025

  • For models that support long context, such as jina-embeddings-v3, using a small max_model_len (e.g., 1k) + automatic chunked processing achieves similar results as max_model_len (e.g., 2k), but faster(Needs data support).

In fact, I am more concerned about this data.

Please use

  1. Use the correct pooling method within max_model_len
  • jina-embeddings-v3 with mean pooling,
  • BAAI/bge-m3 with cls pooling,
  • Qwen3-Embedding-0.6B with LAST pooling,
  1. Limit max_model_len, enforce automatic chunked processing
    set max_model_len = 512,
  2. use automatic chunked processing to extend up to 8k or 32K, (Equivalent to using mean pooling beyond max_model_len)
  3. Compare with the native long context
  • whether the performance is comparable to the original context expansion capability,
  • and whether there is a significant improvement in speed.

I even feel that the automatic chunked processing extension may perform better than the original context expansion in some scenarios, more than just speed up(more than performance vs speed trade off).

@noooop
You mean forcing all three models to use a length of 512 for automatic chunked processing? And then you need me to compare their recall metrics and throughput against when they don't use automatic chunked processing?

@noooop
Copy link
Contributor

noooop commented Jul 18, 2025

@noooop You mean forcing all three models to use a length of 512 for automatic chunked processing? And then you need me to compare their recall metrics and throughput against when they don't use automatic chunked processing?

I think automatic chunked processing is a method of context expansion for embed.
try set max_model_len = 512 and set max_embed_len = 8K

@x22x22
Copy link
Author

x22x22 commented Jul 18, 2025

@noooop You mean forcing all three models to use a length of 512 for automatic chunked processing? And then you need me to compare their recall metrics and throughput against when they don't use automatic chunked processing?

I think automatic chunked processing is a method of context expansion for embed. try set max_model_len = 512 and set max_embed_len = 8K

@noooop

max_embed_len = 8K means we can only test datasets up to a maximum of 8k. So I only need to provide you with evaluation results for 512~8k, is that correct?

@noooop
Copy link
Contributor

noooop commented Jul 18, 2025

@noooop

max_embed_len = 8K means we can only test datasets up to a maximum of 8k. So I only need to provide you with evaluation results for 512~8k, is that correct?

jina-embeddings-v3 and BAAI/bge-m3 natively only support 8k

The computational complexity of self-atten is O(n^2), so computing 16 blocks of 512 should be faster than computing one block of 8K.

Copy link

mergify bot commented Jul 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @x22x22.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 18, 2025
@maxdebayser
Copy link
Contributor

@x22x22 , thanks for the result. This is pretty cool, it feels like we're writing a paper.

Here is a model with 512 context and CLS-based pooling: ibm-granite/granite-embedding-278m-multilingual. I haven't found one with LAST pooling yet. My interest in testing this is that I think that regardless of the pooling type of the underlying model, we should see an extension with MEAN-based chunking. Using the CLS or LAST strategy on the chunks doesn't seem to make sense to me because then we're just throwing away chunks. It would be the same as just truncating the input left or right and sending a single chunk of tokens through the mode.

x22x22 added 2 commits July 18, 2025 22:22
…rocessing function, and elaborate on the processing methods and performance characteristics of different pooling types (MEAN, CLS, LAST). Optimize the configuration parameters to ensure that users receive clear warning messages when using non - MEAN pooling, and enhance the support for long - text input.

Signed-off-by: x22x22 <wadeking@qq.com>
@mergify mergify bot removed the needs-rebase label Jul 18, 2025
@x22x22
Copy link
Author

x22x22 commented Jul 18, 2025

@noooop You mean forcing all three models to use a length of 512 for automatic chunked processing? And then you need me to compare their recall metrics and throughput against when they don't use automatic chunked processing?

I think automatic chunked processing is a method of context expansion for embed. try set max_model_len = 512 and set max_embed_len = 8K

@noooop @maxdebayser

https://x22x22.github.io/embedding_models_comprehensive_evaluation.html
https://x22x22.github.io/evaluation_results.md

@x22x22
Copy link
Author

x22x22 commented Jul 18, 2025

@maxdebayser buildkite/fastcheck/pr/docker-build-image has failed to run. It seems that the operation timed out. Could you assist in rerunning it?

… ensure consistency between the task and the model configuration. In case the validation fails, return an error response.

Signed-off-by: x22x22 <wadeking@qq.com>
@x22x22
Copy link
Author

x22x22 commented Jul 21, 2025

@maxdebayser

My code has been submitted and all CI tests have passed. What else needs to be done to merge my PR? Thanks.

@noooop
Copy link
Contributor

noooop commented Jul 21, 2025

@noooop You mean forcing all three models to use a length of 512 for automatic chunked processing? And then you need me to compare their recall metrics and throughput against when they don't use automatic chunked processing?

I think automatic chunked processing is a method of context expansion for embed. try set max_model_len = 512 and set max_embed_len = 8K

@noooop @maxdebayser

https://x22x22.github.io/embedding_models_comprehensive_evaluation.html https://x22x22.github.io/evaluation_results.md

Thank you for the thorough testing; Native does indeed perform better.

@x22x22
Copy link
Author

x22x22 commented Jul 21, 2025

@noooop You mean forcing all three models to use a length of 512 for automatic chunked processing? And then you need me to compare their recall metrics and throughput against when they don't use automatic chunked processing?

I think automatic chunked processing is a method of context expansion for embed. try set max_model_len = 512 and set max_embed_len = 8K

@noooop @maxdebayser
https://x22x22.github.io/embedding_models_comprehensive_evaluation.html https://x22x22.github.io/evaluation_results.md

Thank you for the thorough testing; Native does indeed perform better.

@noooop

Yes, when staying within the model's native length, the automatic chunked processing mechanism shows no advantage in terms of recall metrics alone.

However, looking back at cases where the input exceeds the native length by 1-2 times, there is indeed an advantage in recall rates. This improvement appears almost exclusively in embedding models that use MEAN pooling strategies.

PixPin_2025-07-21_11-37-53

Therefore, I still recommend not modifying max_length to trigger forced automatic chunked processing.

@maxdebayser
Copy link
Contributor

maxdebayser commented Jul 21, 2025

@x22x22 , thanks for the comprehensive tests. Actually I think that we shouldn't touch the native pooling type while testing, because changing from MEAN to CLS for example is going to perform poorly.

Here are three models that we could test that have the 3 pooling types and shortish contexts:

LAST: BAAI/bge-multilingual-gemma2, context 8k
MEAN: intfloat/multilingual-e5-large mean, context 514
CLS: ibm-granite/granite-embedding-278m-multilingual, context 514

If we could test each model with the 3 different aggregations I think that would show that mean aggregation always performs better (without changing native pooling type).

My code has been submitted and all CI tests have passed. What else needs to be done to merge my PR? Thanks.

Let's say that depending on the result mean aggregation always performs better. In this case, the code can be simplified a lot.

@x22x22
Copy link
Author

x22x22 commented Jul 21, 2025

@x22x22 , thanks for the comprehensive tests. Actually I think that we shouldn't touch the native pooling type while testing, because changing from MEAN to CLS for example is going to perform poorly.

Here are three models that we could test that have the 3 pooling types and shortish contexts:

LAST: BAAI/bge-multilingual-gemma2, context 8k MEAN: intfloat/multilingual-e5-large mean, context 514 CLS: ibm-granite/granite-embedding-278m-multilingual, context 514

If we could test each model with the 3 different aggregations I think that would show that mean aggregation always performs better (without changing native pooling type).

My code has been submitted and all CI tests have passed. What else needs to be done to merge my PR? Thanks.

Let's say that depending on the result mean aggregation always performs better. In this case, the code can be simplified a lot.


Hi @maxdebayser ,

Thank you for the detailed feedback! I want to make sure I understand your suggestions correctly:

For Testing:
You're suggesting I test 3 models with different native pooling types:

  • LAST: BAAI/bge-multilingual-gemma2 (8k context)
  • MEAN: intfloat/multilingual-e5-large (514 context)
  • CLS: ibm-granite/granite-embedding-278m-multilingual (514 context)

And for each model, test all 3 aggregation strategies (MEAN, LAST, CLS) to compare their performance, while keeping each model's native pooling type unchanged.

For Code Simplification:
If the test results show that MEAN aggregation consistently performs best across all models/pooling types, then I should simplify the code by:

  • Removing the special handling for LAST and CLS pooling types in chunked processing
  • Always using MEAN aggregation (weighted averaging) regardless of the model's native pooling type
  • This would eliminate the complex branching logic and make the codebase much cleaner

Next Steps:
Is conducting these tests the main remaining requirement for merging this PR? Or are there other aspects I should address?

Please let me know if my understanding is correct, and I'll proceed with implementing the tests accordingly.

Thanks!

Copy link
Contributor

@noooop noooop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this work is accepted.

So next,

  • we should organize the test results and inform users which approaches are best practice.
  • Those combinations that perform poorly should be avoided,
  • and those with potential issues need to be restricted at the code level.

LongEmbed is likely the most challenging benchmark. In other tests, the method performs notably better.

@x22x22
Copy link
Author

x22x22 commented Jul 21, 2025

I hope this work is accepted.

So next,

  • we should organize the test results and inform users which approaches are best practice.
  • Those combinations that perform poorly should be avoided,
  • and those with potential issues need to be restricted at the code level.

@noooop

Thank you for the approval! I have a few questions about the next steps you mentioned:

1. Best Practices Documentation:
You mentioned organizing test results and informing users about best practices. I can create a markdown file to document this, but should it be included in this current PR or can I submit it as a separate PR? I'm concerned that keeping this PR open too long might lead to merge conflicts with the main branch.

2. Code-level Restrictions:
Regarding adding more restrictions at the code level, I have two points:

  • Pooling Strategy Detection: Each model's pooling strategy is determined during training and typically isn't reflected in the model files, making it difficult to automatically detect the pooling strategy through code. So for pooling strategy choices, we can only provide helpful hints/warnings (as currently implemented).

  • max_model_len Restriction: From our testing, we found that forcibly setting max_model_len smaller than the model's native length produces almost entirely negative results. We could add a restriction to disallow setting max_model_len when chunked processing is enabled. However, would this modification also be acceptable to submit as a separate PR?

My preference would be to:

  • Keep the current PR focused on the core chunked processing implementation
  • Submit follow-up PRs for documentation and additional restrictions

This approach would help avoid merge conflicts and make the review process more manageable. What are your thoughts on this approach?

else:
# Fall back to max_model_len validation (original behavior)
effective_max_len = self.max_model_len
validation_error_msg = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is the same, only the variables change. You can use string.format() instead of f-strings

@maxdebayser
Copy link
Contributor

Is conducting these tests the main remaining requirement for merging this PR? Or are there other aspects I should address?

These tests will help us understand which parts of the code are actually needed. Once we know that I think there will be opportunities for refactoring and reducing the number of lines of code by a good amount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants